Section 2 - Principal Components Analysis

PCA is a linear transformation:

\begin{equation*} Z = XQ \end{equation*}

If a vector is multiplied by a scalar, the direction stays the same, however, the length can be changes.

If a vector is multiplies by a matrix it can "point" in a new direction (rotated).

Input data X is of shape (N, D).

Tranformation matrix Q is of shape (D x D).

Transformed data Z is of shape (N x D).

Multiplying a vector by a matrix can be seen as rotating the coordinate system of the vector.

PCA is often used for dimensionality reduction.

Information carried by a variable can be assesed by measuring the variance.

A variable having variance equal to zero provides no information.

Another advantage of PCA is that it de-correlate the variables.

High correlation between two variables means that one of them can be disregarded since they provide (approximately) very similar information.

The goal of PCA is to find a matrix Q that will help us find the correlation between the variable so that we have the smallest number of the most crucial information

PCA can help reducing the impact of noise

Assuming that the noise is not a major factor in the data, it is fair to state the principal component of the noise will be somewhere down the list. Thus, the PCA transformation will perform denoising pre-processing.

Furthermore, by reducing number of parameters, risk of fitting the model to the noise or overfitting is significantly reduced.

Latent variables Z can be interpreted as the underlying cause of X

That examplains them being significantly uncorrelated.

Very important: Uncorrelated does not necessarily mean independent, unless distribution is Gaussian.

In PCA, the linear transformation can be applied both ways:

\begin{equation*} Z = XQ \end{equation*}\begin{equation*} X = ZQ^{-1} \end{equation*}

PCA Derivation

Covariance of X is given as: \begin{equation*} Cov(X)= \frac{1}{N} (X - \mu _{X})^{T} (X - \mu _{X}) \end{equation*}

Cov(X) is a D by D matrix.

Eigenvalues and Eigenvectors

Eigenvectors (v) are non-zero vectors vectors that only change by a scalar factor , when a linear transformation is applied to it.

\begin{equation*} Av = \lambda v \end{equation*}

The equation can be re-written as:

\begin{equation*} | Av - \lambda I | v = 0 \end{equation*}

It has a non-zero solution if and only if the determinant is zero: \begin{equation*} det| Av - \lambda I | = 0 \end{equation*}

λ is known as the eigenvalues or characteristic root associated with the eigenvector v.

The goal is to find the eigenvalues and corresponding to them eigenvectors.

There will be D number of eigenvalues (because X is of shape N by D) and their values will be greater or equal to zero.

After finding the eigenvectors and corresponding to them eigenvalues the eigenvalues will be sorted in the descending order and combined in a diagonal matrix:

\begin{equation*} \wedge = \begin{pmatrix} \lambda_{1} & 0 & . & 0\\ 0 & \lambda_{2} & . & 0 \\ . & . & . &. \\ . & . & . & \lambda_{D} \end{pmatrix} \end{equation*}

A V matrix is constructed as follows: \begin{equation*} V = \begin{pmatrix} 1 & 1 & . & 1 \\ v_{1} & v_{2} & . & v_{D}\\ 1 & 1 & . & 1 \end{pmatrix} \end{equation*}

The V matrix is orthonormal which results in that if the vector V is multiplied by another eigenvector the result will be zero.

\begin{equation*} Cov(X)V = V \lambda \end{equation*}

Furthermore, covariance of Z is given as: \begin{equation*} Cov(Z)= \frac{1}{N} (Z - \mu _{Z})^{T} (Z - \mu _{Z}) = \frac{1}{N} (XQ - \mu _{X}Q)^{T} (XQ - \mu _{X}Q) = \frac{1}{N} Q^{T}(X - \mu _{X})^{T} (X - \mu _{X})Q = Q^{T} Cov(X) Q \end{equation*}

By choosing Q = V we obtain: \begin{equation*} Cov(Z) = V^{T} Cov(X) V = V^{T} V \wedge = \wedge \end{equation*}

Off diagonal elements are equal to zero which means that no dimension is correlated with any other dimension. All values of λ were sorted in descending order so the first column of Z has the most variance and thus carries the most information.

PCA Objective Function

PCA minimized objection function given as: \begin{equation*} J = Cov(|X - X_{reconstructed} |^{2}) \end{equation*}

Perfect reconstruction would lead to the error being equal to zero, because: \begin{equation*} X_{reconstructed} = X Q Q^{-1} \end{equation*} Since: \begin{equation*} Q Q^{-1} = I \end{equation*}

Furthermore, Q is orthonormal so: \begin{equation*} Q^{-1} = Q^{T} \end{equation*}

Finally, we can re-state the problem as: \begin{equation*} X_{reconstructed} = X Q Q^{T} \end{equation*}

Reconstruction

The initial data can be reconstructed with a K number of eigenvectors leading to a non-zero reconstruction error. It is a trade-off for lower dimensionality.

\begin{equation*} X_{reconstructed} = X Q_{K} Q_{K}^{T} \end{equation*}

In that scenario the objective function is re-written as (often referred to as Frobenius Norm): \begin{equation*} J = Cov(|x(n) - Q_{K} Q_{K}^{T} x(n) |^2) =|X - XQ_{K} Q_{K}^{T}|_{F}^2 \end{equation*}

More information and explanations can be found:

Implementation from scratch

Based on Sebastian Raschka's blog post: http://sebastianraschka.com/Articles/2014_pca_step_by_step.html

PCA step by step

1) Load a dataset without labels. The data will be d-dimensional.

2) Calculate means for every dimension and samples in the dataset.

3) Compute the covariance matrix for the whole dataset

4) Compute eigenvectors (noted as e_1, e_2,..e_d) and corresponding eigenvalues (lambda_1, lambda_2, ... lambda_d)

5) Sort the eigenvectors by eigenvalues (descending) and choose k of them with the largest eigenvalues.

6) Form a d by k dimensional matrix W, where every columns represents an eigenvector.

7) Use the matrix W to transform the data onto the new subspace:

\begin{equation*} y = W_{T} \times x \end{equation*}

where:

x is a d by 1-dimensional vector representing one sample and y is the transformed k by 1-dimensional sample in the new subspace.

Generate the data

3-dimensional dataset from multivariate Gaussian distribution.

More information can be found:

https://faculty.math.illinois.edu/~r-ash/Stat/StatLec21-25.pdf



In [1]:

    
import numpy as np
np.random.seed(123456789)



In [2]:

    
# Number of samples per class
samples_per_class = 200



In [3]:

    
# Generate the first "class" samples
# Mean
mu_vec1 = np.array([0, 0, 0])

# Covariance
cov_mat1 = np.array([[1,0,0],[0,1,0],[0,0,1]])

# Samples
class1_sample = np.random.multivariate_normal(mu_vec1, cov_mat1, samples_per_class).T



In [4]:

    
class1_sample.shape









    Out[4]:





(3, 200)



In [5]:

    
# Generate the second "class" samples
# Mean
mu_vec2 = np.array([1, 1, 1])

# Covariance
cov_mat2 = np.array([[1,0,0],[0,1,0],[0,0,1]])

# Samples
class2_sample = np.random.multivariate_normal(mu_vec2, cov_mat2, samples_per_class).T



In [6]:

    
class2_sample.shape









    Out[6]:





(3, 200)



In [7]:

    
# Visualizing the data
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from mpl_toolkits.mplot3d import proj3d
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib



In [8]:

    
fig = plt.figure(figsize = (12, 12))
ax = fig.add_subplot(111, projection = '3d')
plt.rcParams['legend.fontsize'] = 10   
ax.plot(class1_sample[0,:], class1_sample[1,:], class1_sample[2,:], 'o', 
        markersize = 8, color = 'blue', alpha = 0.5, label = 'class1')
ax.plot(class2_sample[0,:], class2_sample[1,:], class2_sample[2,:], '^', 
        markersize = 8, alpha = 0.5, color = 'red', label = 'class2')

plt.title('Samples for class 1 and class 2')
ax.legend(loc = 'upper right')









    Out[8]:





<matplotlib.legend.Legend at 0x2721add4e80>



In [9]:

    
# Gathering the dataset together
all_samples = np.concatenate((class1_sample, class2_sample), 
                             axis = 1)



In [10]:

    
# Computing the d-dimensional mean vector
mean_x = np.mean(all_samples[0,:])
mean_y = np.mean(all_samples[1,:])
mean_z = np.mean(all_samples[2,:])

mean_vector = np.array([[mean_x],[mean_y],[mean_z]])
mean_vector









    Out[10]:





array([[0.4890709 ],
       [0.58096305],
       [0.40024479]])



In [11]:

    
# Computing the Scatter Matrix
# Intializing the scatter matrix as zeros
# It will be of size 3 by 3 in this specific example
dimensionality = all_samples.shape[0]

scatter_matrix = np.zeros((dimensionality, dimensionality))
for i in range(all_samples.shape[1]):
    scatter_matrix += (all_samples[:,i].reshape(dimensionality,1) - mean_vector).dot((all_samples[:,i].reshape(dimensionality,1) - mean_vector).T)
    
scatter_matrix









    Out[11]:





array([[499.4167083 , 135.24138206,  89.9976988 ],
       [135.24138206, 438.61830474, 102.19588451],
       [ 89.9976988 , 102.19588451, 476.82651834]])



In [12]:

    
# Computing the Covariance Matrix
cov_mat = np.cov([all_samples[0,:], all_samples[1,:], all_samples[2,:]])
cov_mat









    Out[12]:





array([[1.25167095, 0.33895083, 0.22555814],
       [0.33895083, 1.099294  , 0.25613004],
       [0.22555814, 0.25613004, 1.19505393]])



In [13]:

    
# Calculating Eigenvectors and Eigenvalues from the scatter matrix
eig_val_sc, eig_vec_sc = np.linalg.eig(scatter_matrix)
for i in range(len(eig_val_sc)):
    eigvec_sc = eig_vec_sc[:,i].reshape(1, dimensionality).T



In [14]:

    
eig_val_sc









    Out[14]:





array([691.88036512, 398.80615223, 324.17501403])



In [15]:

    
eig_vec_sc









    Out[15]:





array([[-0.63913277, -0.57920898,  0.50599038],
       [-0.55580333, -0.10689217, -0.82441295],
       [-0.5315938 ,  0.80814047,  0.25360799]])



In [16]:

    
# Calculating Eigenvectors and Eigenvalues from the covariance matrix
eig_val_cov, eig_vec_cov = np.linalg.eig(cov_mat)
for i in range(len(eig_val_cov)):
    eigvec_cov = eig_vec_cov[:,i].reshape(1, dimensionality).T



In [17]:

    
eig_val_cov









    Out[17]:





array([1.734036  , 0.99951417, 0.81246871])



In [18]:

    
eig_vec_cov









    Out[18]:





array([[-0.63913277, -0.57920898,  0.50599038],
       [-0.55580333, -0.10689217, -0.82441295],
       [-0.5315938 ,  0.80814047,  0.25360799]])

The eigenvectors are the same, but the eigenvalues will different.



In [19]:

    
# Sorting the eigenvectors by decreasing eigenvalues
# Creating tuples as follows (eigenvalues, eigenvector)
eig_pairs = [(np.abs(eig_val_sc[i]), eig_vec_sc[:,i]) for i in range(len(eig_val_sc))]
eig_pairs









    Out[19]:





[(691.88036512278, array([-0.63913277, -0.55580333, -0.5315938 ])),
 (398.80615222880954, array([-0.57920898, -0.10689217,  0.80814047])),
 (324.1750140261993, array([ 0.50599038, -0.82441295,  0.25360799]))]



In [20]:

    
# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort(key=lambda x: x[0], reverse = True)
eig_pairs









    Out[20]:





[(691.88036512278, array([-0.63913277, -0.55580333, -0.5315938 ])),
 (398.80615222880954, array([-0.57920898, -0.10689217,  0.80814047])),
 (324.1750140261993, array([ 0.50599038, -0.82441295,  0.25360799]))]



In [21]:

    
# Choosing two eigenvectors with the largest eigenvalues
k_number_of_dimensions = 2
top_k_pairs = eig_pairs[:k_number_of_dimensions]
matrix_w = np.hstack((top_k_pairs[0][1].reshape(dimensionality,1), top_k_pairs[1][1].reshape(dimensionality,1)))



In [22]:

    
# Transforming the samples onto the new subspace (2 - dimensional)
transformed = matrix_w.T.dot(all_samples)



In [23]:

    
plt.figure(figsize = (12, 12))
plt.plot(transformed[0, 0:samples_per_class], transformed[1,0:samples_per_class], 
         'o', 
         markersize = 7, 
         color = 'blue', 
         alpha = 0.5, 
         label = 'class1')
plt.plot(transformed[0, samples_per_class:], transformed[1,samples_per_class :], 
         '^', 
         markersize = 7, 
         color = 'red', 
         alpha = 0.5, 
         label = 'class2')
plt.xlim([-4, 4])
plt.ylim([-4, 4])
plt.xlabel('x_values')
plt.ylabel('y_values')
plt.legend()
plt.title('Transformed samples with class labels')

plt.show()

Sklearn implementation

Loading the data



In [24]:

    
import pandas as pd
import numpy as np
from sklearn.utils import shuffle



In [25]:

    
def getKaggleMNIST():
    # Column 0 is labels
    # Column 1-785 is data, with values 0 .. 255
    train = pd.read_csv('./data/Section 2/train.csv').as_matrix().astype(np.float32)
    train = shuffle(train)

    Xtrain = train[:-1000, 1:] / 255
    Ytrain = train[:-1000, 0].astype(np.int32)

    Xtest  = train[-1000:, 1:] / 255
    Ytest  = train[-1000:, 0].astype(np.int32)
    return Xtrain, Ytrain, Xtest, Ytest



In [26]:

    
Xtrain, Ytrain, Xtest, Ytest = getKaggleMNIST()



In [27]:

    
Xtrain.shape









    Out[27]:





(41000, 784)

PCA from Sklearn



In [28]:

    
from sklearn.decomposition import PCA



In [29]:

    
pca = PCA()
reduced = pca.fit_transform(Xtrain)



In [30]:

    
reduced.shape









    Out[30]:





(41000, 784)



In [31]:

    
# Variance of each column
pd.DataFrame(data=reduced).var()









    Out[31]:





0      5.140932e+00
1      3.785782e+00
2      3.248402e+00
3      2.840127e+00
4      2.584887e+00
5      2.274046e+00
6      1.730318e+00
7      1.525671e+00
8      1.460173e+00
9      1.241773e+00
10     1.109470e+00
11     1.086073e+00
12     8.993473e-01
13     8.937030e-01
14     8.351696e-01
15     7.827291e-01
16     6.973970e-01
17     6.776641e-01
18     6.260919e-01
19     6.087995e-01
20     5.672881e-01
21     5.362141e-01
22     5.085567e-01
23     4.820611e-01
24     4.682386e-01
25     4.431360e-01
26     4.280157e-01
27     4.102884e-01
28     3.917249e-01
29     3.622263e-01
           ...     
754    2.132367e-32
755    2.132174e-32
756    2.132378e-32
757    2.132417e-32
758    2.132414e-32
759    2.132420e-32
760    2.132300e-32
761    2.132392e-32
762    2.132355e-32
763    2.132405e-32
764    2.132379e-32
765    2.132368e-32
766    2.132413e-32
767    2.132416e-32
768    2.132421e-32
769    2.132378e-32
770    2.132237e-32
771    2.132421e-32
772    2.132339e-32
773    2.132302e-32
774    2.132385e-32
775    2.132383e-32
776    2.132386e-32
777    2.132342e-32
778    2.130706e-32
779    2.113551e-32
780    1.628526e-32
781    6.442654e-33
782    2.321912e-33
783    2.815062e-34
Length: 784, dtype: float64



In [32]:

    
import matplotlib.pyplot as plt
plt.figure(figsize = (12, 12))
plt.scatter(reduced[:, 0], 
            reduced[:, 1], 
            s = 100, 
            c = Ytrain, 
            alpha = 0.5)
plt.show()



In [33]:

    
plt.figure(figsize = (12, 12))
plt.plot(pca.explained_variance_ratio_)
plt.show()



In [34]:

    
# In order to not to lose too much information, the number of dimensions the data is reduced to should be chosen in such way that 
# the cummulative information is above 95%. 
# In this scenario it 
cumulative = []
last = 0
for v in pca.explained_variance_ratio_:
    cumulative.append(last + v)
    last = cumulative[-1]
plt.figure(figsize = (12, 12))
plt.plot(cumulative)
plt.show()